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1.  Introduction 

a 

The  test-and-treatment  problem  originally  defined  by  D.W.  Loveland  is  a  generalisation  of 
the  binary  testing  problem  studied  by  many  researchers(see  |l][2] [6] [7]jll]).  This  problem  is  of 
independent  interest  since  it  finds  applications  in  medical  diagnosis,  systematic  biology  ,  machine 
fault  location,  laboratory  analysis  and  many  other  fields.  A  parallel  algorithm  for  this  problem  is 
presented  which  is  implemented  on  the  Boolean  Vector  MachinefBVM),  a  machine  formed  by  con¬ 
necting  many  tiny  PEs  into  a  cube-connected-cycle  network.  The  PEs  are  so  small  that  a  machine 
with  2®  PEs  is  implemen table  using  current  VLSI  technology,  and  even  2s0  PE  machine  is  feasi¬ 
ble.  By  handling  the  communication  problem  carefully  we  are' able  to  transform  the  dynamic  pro¬ 
gram  solution  into  the  ASCEND  /DESCEND  scheme.  This  solution  to  the  communication  problem 
and  a  careful  algorithm  design  for  generating  control  bite  solves  the  PE  allocation  problem.  As  a 


result  we  are  able  to  achieve  O  ( ■  ^  )  speedup  on  such  a  machine  with  only  ~  connections 


•  Work  reported  herein  in  partially  sapported  by  the  Air  Foree  sader  grant  number  AFOSR  >1-0221  and 
AFOSR  >3-0 20i 
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among  p  PEs 

It  has  been  shown  [3]  [8]  that  finding  optimal  solution  to  the  binary  testing  problem  is  in  gen¬ 
eral  NP-hard;  that  is,  no  polynomial  algorithm  on  sequential  machines  is  known.  Since  the  test- 
and-treatment  problem  generalises  the  binary  testing  problem,  the  tesUand-treatment  problem  is 
also  NP-hard.  With  the  development  of  VLSI  technology  parallel  machines  with  thousands  and 
even  millions  of  processing  elements(PEs)  will  be  available.  It  is  now  practically  possible  to  speed 
up  the  computation  considerably  by  trading  huge  number  of  PEs  for  speed.  Solutions  to  the  NP- 
complete  problems  on  parallel  machines  have  appeared  in  the  literature[4][8j.  Parallel  algorithms 

T 

are  considered  to  be  good  if  the  speedup  5  =  — —  achieved  is  equal  or  close  to  p  ,  where  Tt  is  the 

time  complexity  of  a  parallel  machine  with  i  PEs.  It  is  especially  of  practical  interest  when  these 
algorithms  can  be  implemented  on  ’practical”  parallel  machines  efficiently,  since  PE  allocation 
problem  must  be  taken  into  account  and  communication  between  PEs  must  be  handled  carefully. 
In  this  paper  we  present  a  parallel  algorithm  approach  to  the  problem.  The  Boolean  Vector 
Machine(BVM)  [16]  is  the  parallel  computation  model  we  have  chosen,  a  model  that  uses  the 
Cube- Connected- Cycle(CCC)  [13]  structure.  For  the  BVM,  each  processing  dement(PE)  of  is  con¬ 
nected  to  three  other  PEe  by  a  one-bit  wide  connection  path  It  has  been  estimated  that  a  BVM 
with  2®  PEs  b  feasible.  Such  a  huge  parallel  machine  could  be  used  to  solve  moderate-sited  NP- 
complete  or  NP-hard  problems.  The  time  complexity  and  processor  complexity  of  the  TT  algo¬ 
rithm  on  thb  machine  model  are  respectively  O  (kp  (k +logN  ))'  and  0(N 2*),  where  k  b  the  site 
of  the  universe  which  b  a  set  of  objects  containing  the  malfunctioned  one,  p  b  the  precision 
required,  N  b  the  total  number  of  the  tests  and  treatments'available.  Thb  result  represents  a 


speedup  of  0(  »  ),  with  regard  to  the  known  sequential  algorithm  which  could  be  obtained  by 

log  P 

modifying  the  backward  induction  algorithm  given  by  Garey[l].  The  logp  in  the  speedup  b 
accounted  for  the  communications  needed  among  the  PEs.  As  can  be  shown  by  a  simple  fan-in 
argument,  f l(i  -t-logA’ )  time  is  required  for  the  communication  among  0(N 2* )  PEe.  Considering 


•  Lofar.ibm?  id  tbu  paper  are  to  the  bur  2 


that  each  PE  in  our  machine  has  only  three  links  to  other  PEs,  this  logp  factor  is  quite  reason¬ 
able. 


Of  particular  interest  here  is  the  fact  that  0  (  ■■-  )  speedup  is  shown  for  a  machine  so  sim- 


pie  in  structure  that  the  number  of  PEs  on  the  order  of  2*°(«10P)  is  feasible-  For  2*  PEs,  approx¬ 
imately  15  elements(say,  disease  candidates)  could  be  processed  in  parallel  (assuming  worst  case 
possibilities)  to  find  the  best  treatment  for  the  true  disease  even  if  all  possible  tests  and  treat¬ 
ments  were  available(i.e.  #  =  0(2*).  A  speedup  of  roughly  10s  could  thus  be  realised  over  a 
sequential  processing  of  a  test- and-treat men t  problem  with  15  candidates.  (This  allows  for  the 
parallelism  of  64  bits  that  a  sequential  machine  might  possess.) 


Our  algorithm  was  designed  to  optimise  performance  for  relatively  few  tests  and  treatments, 
e.g.  N=0(t*),  for  fixed  i  .  Other  approaches  are  reasonable  if  A?  =0(2*)  is  commonly  used. 
We  note  that  a  few  more  elements,  e.g.  20,  can  be  processed  in  parallel  if  AT=0(i2),  say. 

The  test  and  treatment  problem  requests  the  selection  of  a  minimum  test  and  treatment 
procedure  under  an  expected  cost  criteria.  The  problem  arises  whenever  a  fault  (disease,  system 
malfunction)  must  be  treated.  The  classic  example  is  medical  diagnosis  and  treatment,  but  other 
applications  also  are  important,  such  as  computer  system  fault  location  and  correction  and  logisti¬ 
cal  system  breakdown  correction.  In  general,  the  problem  exists  whenever  a  sizable  population  of 
complex  objects  (people,  ships,  computers)  must  be  maintained  at  reasonable  cost. 

The  problem  specification  consists  of  a  universe  V  *=  {0,l,...,k-l}  of  k  objects,  each  with 
an  associated  weight  P, ,  and  a  set  of 

/ 

T( ,  l  <  i  <  N, 

tests  and  treatments,  each  with  an  associated  cost.  The  Tt ,  1  <  t  <  m ,  denote  tests,  and  the 
T,,m  <  i  <  N,  denote  treatments.  We  assume  that  only  one  object  is  actually  faulty,  its 
identity  is  unknown,  and  each  object  i  has  a  prion’ likelihood  P,  of  being  the  faulty  object.  Each 
test  and  treatment  is  specified  by  a  subset  of  the  universe;  if  the  unknown  object  is  in  the  test  or 
treatment  set  then  the  test  responds  positively,  or  'is  successful”,  or  the  treatment  is  successful. 
If  the  .test  is  successful,  one  eliminates  the  other  objects  from  consideration  (and  if  negative,  one 


eliminates  the  test  set  of  objects),  while  a  successful  treatment  ends  the  procedure  A  failed  treat¬ 


ment  means  the  processing  must  continue.  A  successful  TT  procedure  must  provide  for  each 


object  to  be  treated;  a  TT  problem  specification  is  sdcfsste  if  there  exists  a  successful  TT  pro¬ 


cedure.  With  each  test  and  treatment  7,  a  cost  t,  of  executing  that  test  or  treatment  is  given 


with  the  problem  specification. 


From  the  above  description  we  see  that  a  TT  procedure  is  a  binary  decision  tree,  with  both 


test  and  treatment  nodes.  A  typical  TT  procedure  is  given  in  Figure  1,  where  the  single  arc  is 


used  for  both  test  outcomes  (the  positive  outcome  to  the  left  by  convention)  and  a  treatment 


failure,  and  the  double-line  arc  denotes  a  treatment  set.  (The  double  arc  is  for  emphasis  only 


since  every  branch  of  a  successful  TT  procedure  must  terminate  in  a  treatment  set.) 


Ta 

/\ 

/  \ 

Tt  Tt 

/  \  /  w 
/  \  /  w 

T,  7,7, 

l\ 

/  \ 

r.  T- 

U—  {0,1, 2, 3,4, 5),  Pi  =1/6  for  all  i, 

2,  7*=0,4,  7 ,=0,1,  7 4=0, 1,3, 

7 6=4,  7,«=0i.  7 7=5,  m  =3 
U  =1  for  all  i. 

Fig  1. 

The  TT  procedure  Tree  has  an  expected  cost  defined  as  follows: 

0  t 

Cost  (Tree  )  —  5*1  (cost  of  all  tests  and  treatments 

itV 

encountered  if  i  is  the  faulty  object)  •  P, 


The  desired  solution  is  the  procedure  which- minimises  this  cost.  Thus 


Cost  =  min  Coet(Tree). 

til  lr««. 
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Rather  than  enumerate  all  the  po6sibl-  TT  trees  and  take  the  minimum  eo6t  directly,  we 
use  the  approach  of  dynamic  programming  and  note  that  the  optimal  tree  must  apply  the 
minimum  cost  action  (test  or  treatment)  to  already  optimal  subtrees.  The  optimal  subtrees  are 
obtained  by  beginning  with  one-object  trees  and  combining  trees  as  just  described,  For  each  one- 
object  set  S  of  objects  we  compute  the  cost  C(S)  of  the  minimum  TT  tree  as  follows: 

<7(S}=  min  U,*(S)) 

where  p  (S )  =  Py  for  5  -=  {/  ).  There  is  only  a  treatment  component  for  one-object  sets  since 
the  set  cannot  be  split.  (Note  that  we  have  not  normalized  the  set  of  weights,  so  technically  these 
are  not  TT  problems  themselves).  For  an  arbitrary  non-singleton  set  S  of  objects  we  compute  the 
cost  C(S)  a?  follows: 

C(S)  =  min[  min  (!,  «p(S)+  C(S  p|  Tt )  +  C(S-r,  )), 

I<t<«  11 

min  (t.-p(S)+  C(S-Ti))}. 

where  p  (5)  = 

i«s 

This  definition  is  from  first  principles:  the  value  t,  is  charged  to  each  object  subject  to  that 
action  and  the  total  weight  of  those  objects  to  be  charged  is  p(S).  For  tests,  one  adds  in  the  cost 
C(S  p|  7", )  of  the  set  5  p|  T,  to  which  the  test  responds  positively  (the  test  set)  plus  the  co6t 
C(5-T, )  of  tbe  set  S-T,  of  objects  not  responding  to  the  test.  Treatments  terminate  action  on 
the  objects  of  T, ,  m  <  i  <  n  ,  (i.e.,  treat  them)  so  the  only  objects  needing  further  action  are 
tbe  objects  in  5 -T, ;  we  add  in  the  ccst  C(5 -T, ).  The  essence  of  an  argument  by  induction  that 
C(S)  is  correctly  computed  uses  the  assumption  that  C(5  p|  T, )  and  C(S-T,  )  are  the  correct 
costs  for  the  subtrees  and  then  we  note  from  the  above  description  that  the  correct  minimum  is 
taken  to  compute  C(S). 

2.  The  Boolean  Vector  Machine 

The  BVM  is  a  CCC  parallel  machine.  It  is  a  typical  ultracomputer[l4j.  Since  tbe  BVM 
communication  network  resembles  the  Benes  permutation  network,  it  can  accomplish  any  permu- 


t&tion  within  0(log  n)  time  if  the  eoDtrol  bits  are  precalculated  Jl0][l3j.  Logically  the  B\”M  can 
be  viewed  as  a  bit  array  shown  in  Fig  2. 
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Fig.  2. 

Each  row  of  the  bits  forms  a  register.  Each  column  forms  a  PE.  Let  r  be  a  positive  integer 
and  Q  —2' ,  there  are  total  2r  +  ®  PEs,  as  required  by  a  complete  CCC  network.  The  number  of 
registers  L  depends  on  the  BVM Implemented  Our  BVM  has  L=256  registers. 

Each  of  the  group  of  PE’s  (2'  •»>;',  ...,2r  •• +2’-l),  0<i  <2®  ,  form  a  cycle,  thus 

the  address  of  PE  2'  •«+/  can  be  represented  alternatively  as  (i,  j)  with  the  first  component 
being  the  cycle  number  and  the  aecond  the  address  within  the  cycle  Within  cycle  i  PE  (i.  j)  is 
only  connected  to  its  predecessor  {*  ,(j +Q-l)Sc<?  )’  and  its  successor  (i,  (j+l)%Q)  In  addition 
each  PE  (i,  j)  is  connected  to  its  lateral  neighbor  (i  "2 *'j),  the  cycles  are  thus  connected  together. 

The  BVM  is  a  bit-oriented  machine  Only  Boolean  function  operations  are  allowed.  Each  of 
its  instruction  involves  possibly  register  A  and  B  and  at  most  another  register.  Its  instruction  has 
the  form: 

{A  or  R[jj},  B  ==  f,  g(F,  D,  B)  {IF  or  NT}  <aet>; 

Two  assignment  operations  will  be  simultaneously  performed  by  executing  this  instruction 
The  first  assigns  f (F,  D,  B)  to  either  A  or  Rjj],  the  second  assigns  g(F,  D,  B)  to  B  f  and  g  are  any 

•  A*  it  the  C  lanaui**!^.  SL  /.  *.  j.  *  are  tbe  modulo,  i*iejer  dieiroD.  and,  or,  and  exelu»i»e-or  operations 
reifectirelT 
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Boolean  functions  of  three  arguments  F  may  be  A  or  R[j],  D  may  be  A.N  or  Rij].N.  N  denotes  a 
neighbor  PE  of  PE  (c,  p)  It  can  be: 

S:  successor  PE  (c  ,(p  +1)%Q  ); 

P:  predecessor  PE  (e  ,(p  +Q-1)%Q  ); 

»  i  -  v  i  nr*  i  ' 

L*.  lattjoj  i  (c  «  ,p 

XS:  even  successor  exchange  PE  (c  ,p  *2°); 

XP:  even  predecessor  exchange  XP=P  if  p  is  even; 

XP=S  if  p  is  odd; 

I  input  one  bit  to  PE  (0,  0),  PE  (2®  outputs  one  bit  at  the  same  time.  All  other  PEs 

get  bits  from  their  predecessors  except  PEs  (.,  0),  which  get  bits  from  PEs  (.-1,  Q-l). 

The  {IF  or  NF}  <set>  denotes  the  activate /deactivate  set.  <aet>  is  a  subset  of  (0,  1, 
2f-l}.  IF  <set>  means  all  the  PE’s  (i,  j),  0 <i  <2®  and  jt<set  >,  will  be  activated  while  the 
remaining  PE’s  will  be  deactivated.  The  meaning  of  NF  <set>  is  just  the  opposite.  If  the  part 
{IF  or  NF}  <set>  is  not  present  in  the  instruction,  then  all  the  PE's  are  activated. 

There  is  a  special  register,  register  E.  which  is  used  as  an  enable/disable  register  PE  i  will 
be  enabled  or  disabled  according  to  whether  its  bit  of  E  register  is  1  or  D  E  register  itself  is 
always  enabled 

The  value  of  PE  s  will  not  be  affected(except  that  of  register  E)  if  it  is  deactivated  or  dis¬ 
abled. 

For  further  details  of  the  BVM,  the  reader  is  referred  to  [J5],  (16]. 

S.  Hypercube  Algorithm 

A  bypercube  connection  network  has  been  suggested  [12]  as  a  network  for  connecting 
together  an  array  of  PEs  The  hypercube  connection  network  connects  PE  x  to  any  PE  whose 
address  differs  from  x  in  exactly  one  bit.  Thus  a  machine  of  2*  PEs  will  have  each  PE  connected 
to  s  PEs  Since  the  hypercube  network  seems  to  be  more  regular  than  the  CCC,  and  each  PE  has 
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more  connections  to  other  PEe,  in  many  situations  designing  a  hypercube  algorithm  is  more  con¬ 
venient  and  straightforward  than  designing  a  CCC  network  algorithm.  Unfortunately  with  n  PEs 
a  hypercube  network  requires  about  a  logn  /2  links.  With  a  CCC  connection  only  about  3n /2 
links  are  needed. 

An  algorithm  is  in  the  ASCEND(DESCEND)[13;  form  if  it  consists  of  a  sequence  of  basic 
operations  on  pairs  of  data,  where  the  addresses  of  the  pairs  differ  successively  in  bit  0,  bit  1,  ..., 
bit  p-1  (bit  p-1,  bit  p-2,  ...,  bit  0),  here  and  henceforth  bits  are  counted  from  the  least  significant 
bit. 

Preparata  and  Vuillemin  showed  in  [13]  that  the  ASCEND/DESCEND  algorithms  can  be 
executed  on  the  CCC  network  fairly  efficiently.  Precisely  speaking,  these  hypercube  network 
algorithms  can  be  simulated  on  a  CCC  at  a  slowdown  of  a  factor  of  4  to  6,  regardless  of  the  net¬ 
work  sites.  Thus  designing  an  ASCEND  /DESCEND  algorithm  for  a  hypercube,  and  transform¬ 
ing  it  into  a  CCC  algorithm  seems  to  be  a  reasonable  way  of  designing  an  efficient  CCC  algo¬ 
rithm. 

• 

In  the  CCC  the  links  connecting  the  lateral  PEs  are  called  highsheaves.  Highsheaves 
correspond  to  the  high-order  bit  connections  in  the  bypercube.  The  number  of  high-order  bits  is 
Q.  the  number  of  bits  of  the  cycle  number.  The  lowsbeaves  are  virtual  links  in  the  CCC  which 
correspond  to  the  low-order  bit  connections  in  the  hypercube.  There  are  r  bits  for  the  lowsbeaves. 
The  lowsheaves  connections  in  the  CCC  is  formed  by  shuffling  or  unsbuffling  data  inside  cycles. 

4.  Several  Important  BVM  Algorithms 

Several  important  BVM  algorithms  are  presented  here.  *These  algorithms  are  useful  in  pro¬ 
gramming  the  BVM.  The  cycle- ID  and  the  processor-ID  are  the  most  basic  modules  which  are 
used  in  almost  all  BVM  algorithms.  Broadcasting  and  propagation  algorithms  handle  the  typical 
dataflow  patterns  on  the  BVM.  These  algorithms  or  their  adapted  version  are  used  to  construct 
the  test- and-treat men t  algorithm.  As  we  present  these  algorithms,  we’ll  also  discuss  how  to  gen¬ 
erate  the  control  bits  for  these  algorithms.  Although  these  eontrol  bits  can  be  precalculated,  it 
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mill  save  the  precalculation  time  and  the  runtime  storage  when  these  control  bits  are  generated  on 
the  fly 

1.  Cycle-ID 

There  are  two  ways  of  viewing  the  cyele-ID.  The  first  is  that  PE  (i,  j)  holds  the  j-th  bit  of 
cycle  number  i.  Thus  the  bits  held  by  all  the  PEs  in  cycle  1  form  the  cycle  number  i.  in  tbe  UUC 
network  the  lateral  links  correspond  to  tbe  high-order  bit  links  in  the  bypercube.  The  alternative 
way  of  viewing  tbe  cycle-ID  is  that  the  bit  of  the  cycle-ID  PE  i  holds  is  1  iff  PE  i  is  at  tbe  1-end 
of  its  lateral  link.  For  the  CCC  with  n=64  PEs,  the  cycle-ID  is  shown  in  Fig.  3. 

PEO  010101010101010  1 

PEI  001100110011001  1 

PE  2  000011110000111  1 

PE  3  000000001111111  1 

ccceccccecccccc  c 

yy  yyyyyyyyyyyyyy 

c  cccccecccccccc 

1  1  1  I  1  I  1  1  1  1  1  1  1  1  1  1 

« 

eeeeeeeeeeeeeee  e 
000000000011111  1 
0123436780012345 

Fig  3. 

tbe  digit  at  cycle  i  and  PE  j  represents  the  bit  held  by  PE  j  in  cycle  i. 

In  {15]  tbe  following  algorithm  is  given  for  generating  the  cycle-ID.  The  time  complexity  is 
O(logn) 

cycle-ID() 

{ 

A=l; 

A=A.I;  /*  input  a  bit  0  to  PE  0  •/ 
for{i=l;  i<Q,  i-M-)  { 

A=A  t  AX; 

A=A.l; 

} 

A«=A.P, 

for(i=l,  i<Q;  i++)  { 

A=*A  k  AX; 

A*=A.P; 

) 


) 


2  Processor-ID 


The  processor-ID  is  defined  as  the  pattern  of  addresses  such  that  each  PE  holds  its  own 
address.  For  8  PEs  the  processor-ID  is  shown  in  Fig.  4. 

PPPPPPPP 
EEEEEEEE 
C  1  2  2  4  5  8  7 

R[ij  0  0  0  0  1  1  1  1 

Rfi+l]  0  0  1  1  0  0  1  1 

R[i+2]  0  10  10  10  1 

Fig  4. 

The  algorithm  for  generating  the  processor-ID  is  as  follows: 


Proceasor-ID() 

{ 

1.  R[Sj=cyde-ID0; 

2.  for(i— 1;  i<Q;  i++)  { 

R[S+i]=R[S+i-l]; 

R[S+i]*=R[S+i].S; 

3.  for(i— =0;  i<Q/2;  i++.) 

for(j=0;  j<Q;  j++)  { 

if(i%2=-0)  R(S+j]*RfS+j]XP  IF  { e  |  i<e<Q-i}; 
else  R[S+j]-R[S+jlOS  IF  {e  |  i<e<Q-i); 

} 

4.  for{i=0,  i<Q  i++) 

for<j=0,  j<r;  j++) 

R[S-j-lj=bit(r-j-l,  i)  IF  {  i  }; 

} 

Recall  that  R[i;5,  RjiJXS,  R[i]JCP  are  the  R[ij’s  successor,  even  successor  exchange  and 
even  predecessor  exchange  respectively.  The  function  bit(p,  q)  is  the  p-th  bit  of  q.  The  timing  is 
0(log2n  ).  Here  we  give  an  example  of  the  execution  of  the  algorithm.  In  Fig.  5,  (i)  is  the  pattern 
created  after  the  execution  of  the  statement  labeled  i  in  the  algorithm. 

3.  Broadcasting 

The  broadcasting  algorithm  is  used  tp  broadcast  data  from  one  PE  to  all  the  other  PEs.  In 
|8]  Nassimi  and  Sahni  studied  data  broadcasting  on  SIMD  machines,  here  we  only  consider  the 
broadcasting  on  the  BVM  This  algorithm  and  the  algorithms  for  propagation  will  be  presented  as 
ASCEND  hypercube  algorithms,  for  the  transformed  BVM  algorithms  look  much  complicated  and 
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0010  0001  0011  1000 
0100  0010  0110  0001 


0011  0011 


Fig  5.  Ad  instance  of  n=M. 

provide  little  insight  into  the  algorithms.  Let  the  total  number  of  PEa  be  2"  .  Let  j<i>  denote 
the  i-th  bit  of  j  and  j#i  denote  the  binary  number  which  is  obtained  by  complementing  the  i-tfc 
bit  of  j.  The  following  algorithm  broadcasts  the  content  of  PE  0  to  all  the  PEs. 

BroadcastingQ 

{ 

SENDER =0; 

SENDER(PE|0])=1; 
for(i=0;  i<m;  i++) 
forall  PE(j]; 

if  l-END(PE[j],  i)  &&  SENDER(PE[j#i]) 

PE[j]  -  PE{j#i]; 

SENDER(PE|j])  =  SENDER(PE|j#i]); 

}  1 

SENDER  is  a  register.  SENDER(PE[0])  is  the  bit  belonging  to  both  register  SENDER  and  PE  0. 
l-END(PE[j],  i)  is  true  when  i-th  bit  of  j  is  1.  For  a  18- PE  array,  the  broadcasting  process  is 
shown  in  Fig.  6. 


2.  0000 ->0010, 

3.  0000 ->0100, 

0010 ->  0110, 
4  0000 ->  1000, 

0010 ->  1010, 
0100 ->  1100, 
0110 ->  1110, 


0001  ->  0011, 
0001  ->  0101, 
0011  ->  0111, 
0001  ->  1001, 
0011  ->  1011, 
0101  •>  1101, 
0111  •>  1111. 


Fig.  6. 

The  algorithm  is  an  ASCEND  algorithm,  thus  O(m)  time  is  enough  to  execute  the 


transformed  algorithm  on  a  2"  -PE  CCC  computer. 


To  run  this  algorithm  on  the  BVM,  the  control  bits  must  be  considered.  The  approach  we 
choose  works  as  follows.  First  an  arbitrary  register  SENDER  in  the  BVM  is  choeen.  Set  every  bit 
of  SENDER  to  0  by  using  one  instruction.  Then  input  a  bit  1  to  the  bit  belonging  to  both  PEjO] 
and  register  SENDER  Afterwards  this  bit  will  be  broadcast  in  the  instruction  PE[j]~cPE[j#ij, 
and  the  content  of  register  SENDER  will  be  used  to  identify  the  sender.  If  the  number  of  bits  to 
be  broadcast  is  k,  then  the  algorithm  takes  O(km)  time. 

4.  Propagation 


Propagation  refers  to  propagate  data  from  one  set  of  PEs  to  another  set.  We  consider  two 
kinds  of  propagation  here. 

The  i-PE  group  is  the  set  of  PEs  whose  addresses  have  exactly  i  l’s.  The  first  kind  of  pro¬ 
pagation  considered  here  propagates  data  from  the  N-PE  group  to  the  (N+l)-PE  group  such  that 
PE  j  in  the  (N-f  1)-PE  group  receives  data  from  PE  k  in  the  N-PE  group  iff  when  k<i>=l  then 
j<i>«l. 

t 

Example: 

N~2.  For  the  16-PE  array,  PE  0111  receives  data  from 
PE  0110,  0101  and  0011.  # 


The  algorithm  is  shown  below : 


Propagation  l() 

{ 

for(i»0;  i<m;  i++) 
foraJJ  PEjjJ 

if  SENDER(PE[j#i])  Uc  l-END(PE|j],  i) 

PE[j!  *=  COMBINE(PE!j],  PE[j#i]); 

} 

The  time  complexity  u  0[m).  If  the  propagation  needed  requires  data  to  how  through  the 

O-PE  group,  the  1-PE  group . the  m-PE  group,  then  the  algorithm  must  be  used  m  times  and 

the  timing  will  be  0(m*). 

It  might  seem  passible  to  enable  all  the  PEs  in  the  (N+1)-PE  group,  thus  allowing  the  pro¬ 
pagation  to  be  done  more  naturally.  However  in  many  situations  in  which  the  algorithm  is  used, 
data  are  propagated  from  the  O-PE  group  to  the  1-PE  group  to  the  2-PE  group,...,  to  the  N-PE 
group,  but  initially  no  PE  knows  which  group  it  belongs  to  except  the  PE  in  the  O-PE  group 
which  is  PE  0.  Thus  a  PE  in  the  (N-t-lfPE  group  will  know  that  it  is  in  the  (N-t-l)-PE  group 
from  the  fact  that  the  sender  is  in  the  N-PE  group  and  it  itself  is  at  the  1-end  of  the  communica- 

a 

tion  link.  Certainly  one  can  generate  the  processor- ED  and  count  the  number  of  l’s  in  it  to  decide 
to  which  group  each  PE  belongs,  but  that  involves  more  overhead. 

Another  kind  of  propagation  requires  data  to  propagate  from  the  N-PE  group  to  the  M-PE 
group,  where  without  loss  of  generality  we  assume  N  <  M  .  PE  k  in  the  M-PE  group  will  get  the 
data  from  PE  j  in  the  N-PE  group  aueh  that  if  j<i>=l  then  k < i > =1 
Example: 

M=3,  N=l,  for  the  10-PE  array,  PE  0111  will  get  data 

* 

from  PE  0001,0010,0100  # 

Though  the  first  propagation  algorithm  may  be  used  for  this  purpose,  it  is  relatively  >low. 
The  alternative  way  to  do  it  is  shown  in  the  following  example. 

Example: 

n-»4,  data  will  propagate  from  the  1-PE  group  to  the  4- 
PE  group 


Initially  d&ta  are  in  PEs  0001,  0010,  0100,  1000. 


1.  0001  does  not  transmit.  0010  •>  0011,  0100 ->  0101,  1000  •>  1001. 

2  .  0001  ->  0011,  0011  does  not  transmit,  0101  ->  0111,  1001  ->  1011,  0011 
combines  tbe  data  coming  from  0001  and  the  data  inside  itself. 

S.  0011  ->  Gill,  0111  does  not  transmit,  lull  ->  1111,  0111  combines  the 
data  coming  from  0011  and  the  data  inside  itself. 

4.  0111  ->  1111,  1111  does  not  transmit,  it  combines  the  data  coming  from 
0111  and  the  data  inside  itself.  # 

Now  we  shall  discuss  the  generation  of  the  control  bits.  We  use  one  bit  in  each  PE  to  indi¬ 
cate  if  it  is  a  legal  sender.  Only  PEe  with  this  bit  set  have  the  right  to  send.  In  the  sending  pro¬ 
cess  the  sender  will  not  discard  this  bit.  The  receiver  acquiring  this  bit  will  become  a  legal  sender. 
If  the  receiver  gets  two  bits  from  two  senders,  it  must  combine  the  data  and  the  control  bits 
(using  a  logical  or  operation).  Notice  that  this  is  different  from  tbe  scheme  in  the  propagation  of 
the  first  kind,  since  here  immediately  after  the  receiver  gets  the  data  it  becomes  a  sender.  In  the 
propagation  of  the  first  kind,  a  sender  or  a  receiver  remains  the  same  until  all  the  PEs  in  the  (i  -f 
1)-PE  group  have  been  connected  to  the  PEe  in  the  i-PE  group. 

Initially  PEs  in  the  N-PE  group  are  identified  as  senders.  Only  PEs  in  the  higher  PE  groups 
can  get  data  from  lower  PE  groups  To  control  the  direction  of  the  dataflow  on  the  BVM  the 
cycle-ID  should  be  used 

The  algorithm  is  as  follows  : 

Propagation2() 

{ 

for(i=0;  i<m;  i++) 
forall  PEjjj; 

if  SENDER(PE[j#i])  &&  l-END(PE[j],  i) 

{ 

PE|jJ  -=  COMBINE(PE[j],  PE|j#iJ); 

SENDER(PEUI)  -  SENDER(PE|j#ij); 

} 

} 
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The  timing  is  0(m)=0(logn) 


5.  A  Parallel  Algorithm  for  the  TaaU  and- Treatment  Problem 

In  actual  computation  we’ll  assign  an  array  M[S,  k]  to  calculate  C(S): 
M \S  ,i  H «.  f  (S )+ C (5 f| Ti )+  C (S - T,  ),0< i  < m  ,  and  M  ( S ,*  ]=!,•  p  (S )+  C (S - 7,  ),m  <i  <N , 
therefore 

C (5 )—  min  { A/  [S ]  1 0< «  <  AT } . 

The  parallel  algorithm  is  given  below: 


TT(S,  T,  P,  t) 

{ 

foreach  i:  0<=i<N  do  { 

TP  [S  ,«)-<.?  (5),  if  #S  >  0, 


} 


M 


0,  if  #5-0. 
INF ,  • tkerwue . 


for(j=l;  j<=#U;  j++)  { 

foreach  (S,  i):  U  D  S  and #S=j  and  0<— i<N  do  { 
Af(5,.-]=A/[5-T,-,f]; 

M\S  ,i)+— 7PJ5  ,i’J; 

if(i  <-  m)  M  [5  ,•' )+— Af  (5  p)  T(  ,i ]; 

} 

foreach  (S,i):  l'  D  S  and  #S=j  and  0<  =  i<N  do 
M  [5  ,i]=m«n  (Af  |5  ,x  ]  |  0<x<N), 

) 

) 

} 


#S  in  the  algorithm  denotes  the  site  of  set  S. 


Note  that  the  conditions  5p)T,  ^d  and  5 - 7,  y^4>  will  be  taken  into  account  automatically. 
If  5  p| Ti  and  7,  is  a  test,  then  S  - 7(  «=S  .  Since  S  is  initialised  to  infinity(INF),  we  have: 


M  |S  ]-*  p  (S  H  C(S  n  7V  K  c  (S  -  Ti ,• )-  ti  p  (5  }+  C  (S  )=INF. 

So  it  will  be  excluded  in  the  minimisation  or  C(S)  will  be  INF  depending  on  whether  or  not 
there  is  a  M[S,j]  such  that  M[Sj)<INF.  The  same  reasoning  applies  if  7,-  is  a  treatment  or  if 
S-7,»d(then  5f|7,  -5) 

The  j-loop  is  needed  because  when  we  calculate  MjS.i]  we  should  have  all  the  C(Sp|7,  ) 


•  16  - 


and  C (5- 7,- )  available. 


L* 


6.  The  ASCEND /DESCEND  Algorithm 

Can  our  parallel  TT  algorithm  be  transformed  into  the  ASCEND  /DESCEND  form? 
Observe  that  if  we  assign  a  PE  to  each  (S,  i)  pair  then  M[S,i]  and  TP[S,i]  are  placed  at  the 
different  sections  of  the  same  PE,  thus  the  instruction  MjS,i]  +*=TP[S,ij  can  be  executed  in  paral¬ 
lel  by  all  PEs  at  once.  However  the  instruction  M[S,i  j*A/[5-r< ,«]  ,  M (S ,« ]+—= M \S  p| T{ ,i  ] 
and  the  minimization  require  communications  between  different  PE6. 

The  minimization  part  of  the  algorithm  can  be  transformed  into  the  following  ASCEND 


for(t=0;  t<log  N;  t++) 

foreach(S^).  UDS  and  #S==j  and  0<i  <  S  do 

M[S,i]==min(M[S,i],  MjS,  i#t]); 

where  i#t  is  the  binary  number  obtained  by  complementing  the  t-th  bit  (from  right)  of  i. 

Suppose  N  «=2f  (  otherwise  we  let  T^  =  TA'+i=...>=T#>_1-=17  and  all  of  them  will  be 
treatments  with  cost  INF),  after  executing  M[S,i]’»=min(M[S,i],  M[S,i#0|), 

M|S jJ-M[Sj+l)-min(M!Sj\ MjSj+1]),  where  j=0,  2,...,  N-2. 

Assume  after  executing  M[S,i;=min(MjS,i],  M[S,i#(q-l)]): 
M[Sj]-M[Sj+l)-...-M|S,/+2»-l]- 
min(M[S,j],  MjSj+l),...,  MIS,/  +2,-l])I 
for  j:  /  /2,~1  is  even. 

After  executing  M!S,i]*=min(MjS4],  MJS,  i#q])  )*min(M[Sj],  MlSj-^*  ]), 

for  y  j/(2» )  b  even  and  0</  <A’.  By  induction  we  know  now: 
MiS,j]=M[Sj+l)=...-M[Sj-*-2*4,-l)-min(M|Sj),M;Sj+l] . M|Sj+2«*'-I]. 

Let  q*=p>l,  all  the  PEs  associated  with  set  S  will  get  the  minimum  value.  Fig.  7  shows  an 
example  with  p*»3. 

Now-  consider  the  instruction: 


i  initial  value  |  t=  12  3 
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Fig.  7. 


foreach  (S,i)  M|S,i]=M[5-r, ,.]. 

Can  this  operation  be  transformed  into  the  ASCEND /DESCEND  form? 


Let  us  begin  by  expanding  it  into  its  component  operations  as  follows: 


foreach  (S,i)  do  { 

R[S,i]=M[S-T, 

M[S,i]=R[S,i], 

} 

This  emphasises  that,  if  g  unique  PE  holds  each  element  of  M(Sj],  some  communication 
between  PE6  is  needed.  The  variable  R  is  introduced  to  handle  the  possibility  that  the  PEs 
involved  are  not  in  fact  neighbors,  so  that  each  item  of  information  must  be  passed  along  a  chain 
of  PEs  before  it  reaches  its  destination  For  simplicity,  we  also  assume  that  RjS.ij  is  located  some¬ 
where  in  the  memory  of  the  same  PE  that  also  holds  M[S,i],  for  each  distinct  pair  (S,i). 

Consider  now  the  operation 

For  each  S  and  i,  this  operation  is  well-defined,  since  the  set  S-7*,-  is  uniquely  determined 

4  _ 

by  S  and  7, ,  thus  ensuring  that  each  PE  which  receives  information  during  this  activity  receives 
information  from  only  one  PE.  However,  the  converse  is  not  true:  a  given  PE  may  send  its  infor¬ 
mation  to  several  others,  as  the  example  shown  in  Fig.  8. 

Thus  A/(d,«]  will  send  its  value  to  A[d,t],  R[{0}ti],  R|{l},i]  and  R[{0,l}ri].  and  M[{2},i) 
will  send  its  value  to  the  other  four  PE>  In  general,  Af[5-7,-,«)  must  be  broadcast  to 
R  j(5-7,  )jj  V,t],  for  each  V  such  that  The  following  loop  accomplishes  the 
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U=  {0,1,2},  T={0,l}. 

If  Sis;  S-Tis: 
4  4 

{0}  4 

{1}  * 

{1,0}  4 

{2}  {2} 

{2,0}  {2} 

{2  11  f2l 

{2.1,0}  {2} 

Fig.  8. 


required  broadcast,  for  all  i,  0<i  <N : 

R|S,i)=M[S,i); 

for(e=0;  e<k;  e++) 

foreach  (S,i):  U  3  S  and  0<i  <N  and  e  tS  pjF,  do 
R[S,il— R[S-{e},i]; 

M[S,i]=R[S,i]; 

Continuing  the  previous  example,  Fig.  8  shows  the  value  of  Q,  just  after  the  e-th  iteration 
of  the  loop. 
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Let  If  —  {jtU  |  j  <f  },  then  just  before  e  takes  on  vah^e  t,  R  |(S-Ti)[J(5p}T',p}/,..i).t"j 
holds  M  (S  -  T, ].  For  this  is  true  initially,  before  e  takes  on  value  0,  since  J. and 
/?(5-7’,-,«]  holds  M[S-Ti,i).  Assuming  the  statement  was  true  before  e  was  set  to  t,  the  loop 
body,  executed  with  e  set  to  t  causes  certain  elements  of  array  R  to  change  in  value.  Specihcallv, 
whenever  f€Sp|7',t  R[S,i]  is  replaced  by  R[S-{t},i].  Suppose  175 p}T,  then 

5  n  T>  n;<  -=5  D  T*  fY« -»•“<»*  K^-r,  )(J(S  f) T*  H7'  )’•  ]  **  and  held  M  [5 -T. ; 

prior  to  this  execution  of  the  loop  body,  by  the  inductive  assumption  Suppose  instead  that 
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/  <5  p|  7,  .  Then  Ft  ;(S  -7,  )(J{5  P|  T,  p|7, ).» '  is  changed  to  this  iteration  of  the  loop. 

Since  A'f|A-(A'n^-i)LKA'D^})’  “d  '  5 f| 7*  fl7«  =5 fl T>  fV-iLM 1  >■ 

Because  t  tS  p)  T,  ,  /  <  T,  thus  t  TS  - T, ,  and 

(S  -  Ti  )\J(S  n  li  rv«  M «  -  TV  HJ(5  n  T’.  n  -i).  ky  l^e  inductive  assumption, 

/i1  i(5-7f)j^1'v5^7iP]/|_|),t  j=Jtf  jS-7*itij.  Hence,  after  the  assignment, 

*i(5-7i)n(5n»vnA).*,J-"^-ri..‘]aho. 

Finally,  after  all  iterations  of  the  loop  on  e,  I, —  U ,  and  5  p|  7",  P)7,  —S  p|  T, .  Thus,  for  all 
S  and  i.  F  \S ,» ]=Af  jS-T,-  ,i  ]. 

Similarly,  the  operation 

•/  (•  Sm  )  A/  [5 ,» ]+=A/ |5p|  T, ,« ] 
can  be  resolved  into: 

Q  |S,»]»=A/(Sp)7,,-.,»); 

if(i<-m)M[S,i]+-Q;Ssi); 

Again,  concentrating  on  the  operation  R[S,i]“=A/ [Sp|T,  ,•'],  we  observe  that  value 
Af  [5  n  7",  ,i  j  can  be  broadcast  to  each  PE  holding  Q  [F^J(5  pjT,  ),i],  for  every  V  such  that 
S-T,  D  V .  By  symmetry  with  the  previous  argument,  the  following  loop  performs  this  operation 

Q!Sj)-M;Sfi;: 
forfe=0;  e<k;  e++) 

foreach  (S,i)  U  D  S  and  0 <i  <  A’  and  e  c5-3"i  do 
Q[S,i=Q[S-{e},i]; 

The  complete  algorithm  now  appears  as: 
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TT() 

{ 

foreach  i:  0<=i<N  do  { 
if(#S>0)  TP[S.ij= »p  (5 ); 

M[^,i]=M); 

if(#S>0)M[S,i]=INF; 

} 

forfi=l;  i<=k;  i++)  { 
foreach  (S,i):  Px(S,i)  { 

Q!S,i]=R|S,ij=M|S,i]; 

} 

for(e=0;  e<k;  e++)  { 
foreach  (S  i):  P  t(S  ,i)  and  e  tS  p)  T,  { 

R;S.i]=R[S-{e},i]; 

} 

foreach  (S.i):  P  ,(5  ,t )  and  e  <5  -  T,  { 

Q;S.r=QiS.{e},i): 

} 

} 

foreach  (S,i);  P(S,i,j)  { 

MjS,i!=RjS,i]; 

MfS.ij— TP(S,i]; 
if(i<=m)  M[S,i]+*=Q|S,i]; 

} 

for(t=0,  t<log  N;  t++) 
foreach  (Sj)  P(S,i  j)  { 

M;SPi'=min(MjSIi],M!Sti#t]); 

} 

} 

} 

Where  Pt(S  ,i)mUDS  and  0<«  <A’,  and  P(S  ,£j)mP,(S ,«')  and  #S=j. 


7.  Implementation  Schemes 

On  the  BVM  each  PE  will  stand  for  a  pair  (i,  j),  where  i  and  j  are  binary  numbers  and  ij, 

* 

the  concatenation  of  i  and  j,  is  the  address  of  the  PE.  |i|,  the  number  of  bits  .in  i,  is  k.  The  com¬ 
ponent  i  denotes  a  subset  S  of  U,  a  <  S  iff  a-th  bit  of  i  is  1.  j  is  the  index  of  a  test  or  a  treatment. 
Example : 

k=4,  PE  011011  stands  for  (0110,  11),  the  set  S  denoted 
by  i  b  {1.  2),  tbe  index  number  of  the  test  or  the  treat¬ 
ment  is  3  # 
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The  activate/deactivate  mechanism  is  very  convenient,  unfortunately  it  can  only  provide 
limited  masking  capabilities  The  enable  register  can  provide  any  kind  of  enable /disable  patterns, 
but  generating  these  control  bits  is  difficult.  Here  we  show  bow  these  control  bits  are  generated 
and  how  the  algorithm  is  implemented  by  using  the  algorithms  introduced  in  section  5. 

The  predicates  <<5p|r,  and  e<S-!T,  can  be  implemented  by  using  the  processor-ID.  The 
processor-ID  bits  will  let  each  PE  know  the  set  S  it  represents.  7j  should  be  input  to  the  BVM. 
The  mo6t  interesting  part  of  the  algorithm  is  the  loop  indexed  by  the  variable  e.  The  technique 
used  here  is  the  one  we  introduced  in  the  propagation  algorithm  Note  that  by  imposing  the  con¬ 
ditions  ee5p|T,  and  etS-T,  the  result  becomes  R  [S.t  (S-T,  ,i  j  and 
Q  [S  ,i  j—  Q  [5  pj  T, ,« ].  The  dataflow  is  controlled  by  the  predicate  et5p]T,  and  eeS-Tj. 
Because  each  iteration  of  the  loop  indexed  by  j  will  increment  the  site  of  the  sets  #S  by  1,  the 
predicate  P(S,  i,  j)  can  be  implemented  by  using  the  propagation  of  the  first  kind.  The  cyele-lD 
will  be  used  in  the  propagation  algorithm 


8.  Conclusion 


Many  NT-complete  problems  can  be  solved  on  the  BVM  fairly  efficiently,  as  we  illustrate 
using  the  test-and-treatment  problem  Indeed,  the  test-and-treatment  problem  itself  is  of  real 
interest  as  it  has  many  important  applications.  A  parallel  algorithm  for  this  problem  is  presented 
and  implemented  on  the  Boolean  Vector  Machine  The  communication  problem  and  the  PE  alio- 


cation  problem  have  been  solved  so  that  a  speedup  0(7  -  )  is  achieved.  Algorithms  used  in  con- 

logp 


structing  the  test-and-treatment  algorithm  have  also  been  presented  and  reveal  in  some  degree  the 

* 

different  methods  of  programming  serial  and  parallel  machines. 
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